Block-Structured Adaptive Mesh Refinement Algorithms for 

Vlasov Simulation 

J. A. F. Hittinger and J. W. Banks 
February 5, 2013 

Abstract 

Direct discretization of continuum kinetic equations, like the Vlasov equation, are 
under-utilized because the distribution function generally exists in a high-dimensional 
(>3D) space and computational cost increases geometrically with dimension. We pro- 
pose to use high-order finite-volume techniques with block-structured adaptive mesh 
refinement (AMR) to reduce the computational cost. The primary complication comes 
from a solution state comprised of variables of different dimensions. We develop the 
algorithms required to extend standard single-dimension block structured AMR to the 
multi-dimension case. Specifically, algorithms for reduction and injection operations 
that transfer data between mesh hierarchies of different dimensions are explained in 
detail. In addition, modifications to the basic AMR algorithm that enable the use of 
high-order spatial and temporal discretizations are discussed. Preliminary results for 
a standard 1D+1V Vlasov-Poisson test problem are presented. Results indicate that 
there is potential for significant savings for some classes of Vlasov problems. 



1 Introduction 

The Vlasov-Maxwell system of equations is a fundamental kinetic model that describes 
weakly-coupled plasma dynamics. The Vlasov equation is a partial differential equation 
in phase space, (x,v) G R N xR M for N,M G [1,2,3] such that M > N, that describes 
the evolution in time, t G M+, of a particle distribution function, f(x,v,t) G M+, in the 
presence of electromagnetic fields, E(x, t) G and B(x, t) G M. N . For a particle species 
a, the Vlasov equation is 

^ + v • V x f a l^(ElvxB). V v / Q = 0, (1) 
at m a 

where the particle charge and mass are q a and m a , respectively. Both imposed and self- 
generated electric and magnetic fields, E and B, respectively, are responsible for the Lorentz 
force in ([!]) and are the solutions of Maxwell's equations (or a specialization thereof): 
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VxE+-— = 0, (2a) 
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V • B = 0, (2d) 
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where eo is the permittivity of free space and c is the in vacuo speed of light. The total 
charge density, p, and the total current, j, are the sums over contributions from all species 
a, 

PCM) = V]a*( x >*) = y^<?a / fad\, (3a) 

a a d 

j( x ,*) = y^Ja(x,t) = y^Qa / vf a dv, (3b) 

and these moments of the distribution function nonlinearly couple Maxwell's equations to 
the Vlasov equation. 

The Vlasov-Maxwell system and related models are fundamental non-equilibrium de- 
scriptions of plasma dynamics. Non-equilibrium kinetic effects in plasmas play a crucial 
role in fusion applications. Understanding and controlling wave-particle interactions is im- 
portant to the success of inertial confinement fusion, where resulting resonant responses can 
interfere with the intended deposition of laser energy [Tj. In magnetic confinement fusion, 
gyrokinetic models, which are a reduced form of the Vlasov equations [2H1], are used to 
better understand the physical mechanisms controlling the core conditions, in particular 
micro-turbulence, which is at the origin of the so-called anomalous transport [5]. 

The Vlasov model also has applicability beyond fusion plasmas. Collisionless shocks in 
astrophysics, which are thought to be driven by electrostatic and electromagnetic instabil- 
ities [6HS], can be accurately modeled by the Vlasov-Maxwell system. The Vlasov-Poisson 



system, where Gauss' Law (2c) is sufficient to describe the relationship between the elec- 
trostatic field and the charge density, is being used in particle beam accelerator design [9]. 
Laser isotope separation is another application area for Vlasov-Maxwell models [10J. 

While these kinetic models may have great relevance, their numerical approximation for 
problems of interest have been constrained primarily by computational cost. For N = 3, 
the distribution functions in the full Vlasov model have a phase-space domain of six dimen- 
sions. Directly discretizing phase space, an approach alternatively referred to as grid-based, 
Eulerian, or continuum methods, incurs a computational cost that scales geometrically 
with the number of dimensions. Thus, while for over forty years work has been done on 
the continuum numerical discretization of the Vlasov equation |llH31j . continuum Vlasov 
methods have been applied primarily to lower-dimensional - so-called 1D+1V and 1D+2V 
- problems. Application of continuum Vlasov to four dimensions (2D+2V) and above has 
been limited |30H33] . In contrast, the particle-based particle- in-cell (PIC) [35] method has 
dominated kinetic Vlasov simulation. PIC methods use Monte-Carlo sampling techniques in 
velocity space to reduce the high-dimensional cost and evolve "clouds" of particles through 
a Lagrangian form of the Vlasov equation. Maxwell's equations, however, are solved on an 
overlaid computational mesh (hence, "in cell"). While this approach is generally less ex- 
pensive than continuum Vlasov discretization, PIC results contain inherent statistical noise 
that generally vanishes only as the square root of the number of particles. 

As computer speed and memory have increased, direct discretization of the Vlasov- 
Maxwell system has become more feasible, but hardware improvements alone are insuffi- 
cient to make full-phase-space, continuum Vlasov codes practical. However, as with PIC 
methods, tremendous savings could be realized if continuum approaches could reduce the 
number of cells used to represent phase space. One means to this end is to employ adap- 
tive mesh refinement and to resolve only those regions of phase space of greatest variation 
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or importance. For instance, block-structured adaptive mesh refinement (AMR) in phase 
space could concentrate cells in the vicinity of localized structure, such as particle trapping 
regions. In addition, flux-based explicit Eulerian schemes have time-step restrictions that, 
for Vlasov, are typically dominated by the maximum particle velocity limits of the phase- 
space domain. AMR allows for high-aspect ratio cells in these regions, which can result in 
significant increases in time step size without a loss of accuracy, since little particle density 
or variation is present at these extreme velocity boundaries. 

Adaptive mesh refinement has a limited history in Vlasov simulation. AMR has been 
used with PIC methods in the simulation of heavy- ion accelerators [9\ [35] . Recent work in 
this area uses a wavelet-based approach |23l [25] , where the semi-Lagrangian interpolation 
is based upon a multi-level wavelet basis and where the local depth of the wavelet hierarchy 
is used to increase or decrease the local mesh refinement. This approach generates a near- 
optimal grid, but progress in this direction seems to have stalled. It may be the case 
that the less regular grid structure may introduce other complications, for example, in the 
construction of moments, that make this approach less attractive. 

In this paper, we present a block-structured adaptive mesh refinement approach suitable 
for the continuum discretization of the Vlasov-Maxwell system. As a proof-of-concept, we 
demonstrate the ideas and techniques in the context of a simpler system, the Vlasov-Poisson 
model, which is presented in Section [2] along with the basic flux-based Eulerian discretiza- 
tion we employ. Thus, we will not address the control of electromagnetic wave reflections at 
coarse-fine interfaces; methods to minimize such reflections are addressed elsewhere in the 
literature [351 [36]. in Section [3j we discuss the block-structured AMR strategy, its benefits, 
and the challenges presented by Vlasov problems. We specifically address a subset of these 
issues that we have resolved in order to demonstrate a successful Vlasov- AMR implemen- 
tation. Sample calculations are presented in Section |4j and we conclude with a discussion 
of the future algorithmic advances that will allow additional gains from AMR applied to 
Vlasov simulation. 



2 Model Problem and Discretization 

Both for our purposes here as well as for many physically interesting problems, the Vlasov- 
Maxwell system can be significantly simplified by assuming an electrostatic limit with sta- 
tionary ions. The electrostatic limit corresponds to an assumption of small magnetic field 
strength. The assumption of stationary ions is appropriate when the ion time scales are 
large compared to that of the electrons, which is typically the case. With these assumptions, 
the Vlasov equation ([!]) for the electron probability density function / becomes 

df 

+ v • V x / - E • V v / = 0, (4) 
under a suitable nondimensionalization. Here E is the electric field, x is the physical space 



and v is the velocity. In the electrostatic limit, only Gauss' law (2c) is relevant. Representing 
the electric field in terms of an electrostatic potential, E = V x 0, Gauss' law becomes the 
Poisson equation: 

Vl^ = p e - Pi = [ fdv-1. (5) 



Here the constant, unit background charge, pi = 1, is the manifestation of the immobile ion 
(proton) assumption. 
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For the purposes of discussing the new adaptive discretization algorithms, we make one 
final simplifying assumption of a so-called 1D+1V phase space (i.e., one spatial dimension 
and one velocity dimension). The final system of governing equations is thus succinctly 
written as 

d A + v d A + d ± d A = Q (6a) 

dt dx dx dv ' 
d 2 6 f°° 

Note that, as characteristic of all Vlasov-Maxwell-type systems, we have a higher-dimensional 
variable, f(x,v), coupled to a lower-dimensional variable, cj>(x). 

In order to discretize the model Vlasov-Poisson system, we restrict our attention to a 
finite domain. For the physical coordinate we let x G [-L, L] and apply periodic boundary 
conditions. Other boundary conditions are also possible, but for the initial-value problems 
considered here, a periodic condition is appropriate. For the velocity coordinate, we truncate 
the domain and consider v G [iVnin' ^max]- This introduces an artificial boundary where we 
apply a characteristic boundary condition. Outgoing characteristics are extrapolated and 
incoming characteristics carry values from an unperturbed Maxwellian distribution. 

Our discretization follows the Eulerian finite- volume formulation developed in \37\ [38] . 



The ID Vlasov equation (6a) is rewritten in flux-divergence form as 



Phase space is divided into cells using a Cartesian grid with mesh spacings Ax and Av 
in the x- and w-dimensions respectively. Integrating over a single computational cell and 
dividing by the volume AxAv, we obtain the exact system of ordinary differential equations 

where the cell average is defined as 

^^AxAvf fdxdv - 
As in (37J EH]; the angle bracket notation is used to indicate face averages, for example, 



1 

(/) *+§j = aW f(x i+ i/2,v)dv. 



The face-averaged fluxes are approximated to fourth order as 

(Vf) i+ y « V 3 (f) i+hj + ((/),+ " (f) i+ y-l) , 

(Ef) i>j+k « m) i>j+ i - ^ (E i+ i - ^-i) (</> mj+ i - (f)i-i, j+ i) ■ 

Notice that, because v is only a function of v and E is only a function of x, the notion of 
a face average is redundant, and the angle bracket is replaced by an an overbar. For more 
details concerning this high-order finite- volume formalism, refer to |37} 138]. 
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The quantity Vj is directly computed as an exact cell average (recall that v is an in- 
dependent variable). The cell-averaged electric field is computed from a potential 4>; to 
fourth-order, this is 

The cell-averaged potential is obtained by solving a discretization of the Poisson equa- 



tion (6b) : 

300; - 16(^+1 + 4>i-l) + (4>i+2 + 4>i-2) = UAxpi, (9) 

where 

"max 

Pi = 1-Av ^ fij- 

3 — V max 

This discretization leads to a linear system with a nearly pentadiagonal matrix (boundary 
conditions slightly alter the pentadiagonal structure). 

For reasons explained in Section [3j the Poisson problem is always represented on the 
finest mesh in configuration space, and so the resulting linear algebra problem can be LU- 
decomposed once for each level of refinement and stored. Periodic boundary conditions 
in x lead to a singular system, which is a well-known problem that is easily addressed by 
projecting out the portion of p(x) residing in the null space of the matrix. This amounts 
to ensuring that P( x i) = Oj an< ^ m so doing, we ensure that 4>(x) is normalized around 
zero. Of course, since we take a derivative of <j)(x) to get E(x), the offset has no effect on 
the solution. 

To complete the description of the discretization, a procedure to derive face averages 
from cell averages must be identified. We use the scheme developed in [32^ I39j . which has 
the property that, for well-represented solutions, a fourth-order centered approximation is 
used. As solution features become sharp on a given mesh, upwind numerical dissipation 
is introduced to smooth out those features consistently. The scheme is described in detail 
in \32\ I39j . but we provide a brief overview here as well. 

We focus on the determination of the face average (f) i+ i other averages follow similar 
derivations. The scheme has many similarities to the popular WENO [40J method and uses 
many of the tools developed in the literature on that topic. The face average is constructed 
as a weighted sum of two third order approximations: 

(f)i+i tj ~ w i+ i_ tj>L (f) i+ i jL + w i+ i^ R (f) t+ i JR , (10) 

with 



and 



(f)i+y,L ~ I ( /< 1., + + ->./•, • ;.,) (11) 

(f) i+ y, R « I [Vij + s/i+ij - /i+2,i) • (12) 

Here the "L" and "R" indicate left- and right-biased, third-order approximations. With 
ideal weighting, w i+ i ■ L = w i+ i ■ R = |, equation (10) becomes the centered, fourth-order 



approximation. Using the standard WENO methodology, provisional weights, w-,i ■ T and 
a d, are determined. To maximize the upwind diffusion in the final numerical method, 
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we assign the larger weight to the upwind, third-order approximation and the smaller weight 
for the downwind, third-order stencil. Thus the final weights are determined as 



if ( Vj > 0) , 



else 



w i+y,R 



mm(w i+ i ijL ,w i+ ^ jtR ), 
m&x(w i+ i jL ,w i+ i jR ). 



(13) 



Note that, as with traditional WENO schemes, convergence rates near certain types of 
critical points (points with many zero derivatives) may be less than optimal. Additional 
modifications to the provisional weights can be made to alleviate this deficiency [41] . 

For the temporal discretization of the semi-discrete Vlasov equation Q, any stable 
method can be used. We choose the standard explicit fourth-order Runge-Kutta scheme. 
At each stage in the Runge-Kutta update, we solve the discrete potential equation|9]) prior 
to evaluating the phase-space flux divergence as given by the right-hand side of Q7 

Consider the ODE initial value problem 



df 



L(J,t), 



dt 

/(0) = /„■ 

The RK4 discretization for the ODE between time level n and n + 1 is 

l 



f n+1 = f n + At^bsh, 



s=l 



k s = L^ s \t n + c s At) , 
/W = f n + a s Atk s -x, 



(14a) 
(14b) 



(15a) 

(15b) 
(15c) 



with a = [0,1/2,1/2,1], b = [1/6,1/3,1/3,1/6], and c = [0,1/2,1/2,1]. Acknowledging 
that the operator L is, in our case, of flux- divergence form, we can write, for example, in 
one dimension, 



r 
r 



s=l 

1 



At 
At 



At^6 s [F l+1/2 (>) 
=i 

£ww» (f (s} 

F A 



where F- 



j+l/2 



[-^i+l/2 '■ / — J 2 
are accumulated interface fluxes. 



l 



" ( /<s) ) 



=i 



(16a) 

(16b) 

(16c) 
(16d) 
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Figure 1: An example of a three-level, block-structured AMR hierarchy. On the left, the 
composite refined grid Qc is shown. On the right, the corresponding mesh hierarchy Qh 
with overlapping patches is shown. All patches on the same level have the same refinement 
ratio relative to the coarsest level; in this case, the refinement ratios are two and four for the 
intermediate and finest levels, respectively. Note that each level is comprised of a collection 
of patches completely contained within the patches of the next coarser level. 



3 Block Structured AMR Algorithms 

Block-structured adaptive mesh refinement \A2\ [43] is a natural fit for certain Vlasov- 
Maxwell problems. Frequently, important fine-scale features in phase space, which could 
substantially benefit from higher resolution, only occupy limited regions in phase space. In 
contrast to the semi-structured, octree-based grids that were used in the earlier Vlasov- 
AMR work [21] , hierarchical block-structured AMR is based upon rectangular grid patches 
at different refinement levels in a global Cartesian index space, as shown in Figure [TJ Using 
a local error estimate or some detection of rapid variation in the solution to identify regions 
of interest, cells that should be refined are tagged. Tagged cells are grouped and expanded 
minimally to form rectangular patches that are inserted into the next level in the hierarchy. 
Slightly larger refinement regions can be used to reduce the frequency of regridding. The 
refinement process can be repeated recursively to form a hierarchy of refinement levels, each 
composed of multiple patches. 

Connectivity information is kept to a minimum in this scheme because the global Carte- 
sian index space provides a simple mechanism by which to identify the relationships between 
patches and levels. Within a level, patches contiguous in indices are neighboring, and across 
levels, the same is true, after adjusting by the net refinement ratio between the two levels. 
In general, for explicit methods, communication between patches is accomplished through 
ghost cells. As an additional savings, by maintaining a consistent solution on all patches, 
even those covered by patches on a finer level, time refinement algorithms that allow for 
nested subcycling on finer levels can be devised. 

Despite all of the previous work on block-structured AMR, applying the technique to 
Vlasov simulation introduces several new challenges. First and foremost, at least two mesh 
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hierarchies must be maintained: one in the M. configuration space and one in the R x M 
phase space. Different kinetic species will, in general, have different masses and temper- 
atures; the bulk of the corresponding particles will therefore occupy different ranges of 
particle velocity, and the structures arising from resonant responses will occur in different 
regions of phase space. Thus, each kinetic species should have its own mesh hierarchy. Thus, 
new algorithms for the simultaneous advancement and coordination of multiple hierarchies 
are required, and more importantly, efficient algorithms to enable communication between 
the hierarchies are required. From a parallel implementation perspective, a hierarchy for 
each species also allows for increased task parallelism when each hierarchy is assigned to a 
subset of processors; with no collisions, kinetic species only communicate through the lower- 
dimensional configuration space, so, given the electromagnetic state, high-dimensional flux 
computations and updates can naturally be done in parallel. 

In this paper, it is our goal to demonstrate solutions to the fundamental issues that 
must be addressed to make effective use of AMR in Vlasov simulation. Specifically, we will 
discuss: 

• Basic modifications due to discretization. Using high-order finite volume and 
a high-order multi-stage schemes departs somewhat from the standard, nominally 
second-order block-structured AMR approach. We describe the modified algorithms 
we use, for example, the intra-hierarchy interpolation operations and the synchronous 
time integration algorithm. 

• Inter-hierarchy transfer operations. The coupling of problems of different di- 
mension and their representation on separate hierarchies necessitates the creation of 
inter-hierarchy reduction and injection transfer algorithms. We discuss algorithms 
that achieve this efficiently. 

• Regridding for multiple AMR hierarchies. Regridding hierarchies, when multi- 
ple related hierarchies are present, requires additional constructs for coordination. 

In the following subsections, we address each of these areas, describing in more details 
the issues involved and explaining our solution approach. We will not specifically address 
efficient parallel decomposition strategies in this work. 

These new AMR algorithms, combined with the high-order discretizations presented in 
Section [2] have been implemented in the Vlasov code Valhalla^] This code makes use of 
the block-structured AMR library S AMRAI (Hj , which has been used in prior plasma- fluid 
simulations |45| . S AMRAI is capable of handling dimensions above three as well as the 
simultaneous existence of multiple and lower hierarchies, possibly of different dimension. A 
graph-based, distributed implementation of mesh metadata is employed within SAMRAI to 
provide excellent scaling to tens of thousands of processors. SAMRAI also provides fairly 
sophisticated, high-level AMR algorithm abstractions, but the simultaneous advancement of 
multiple related hierarchies, as required by Vlasov-Poisson, does not fit into these integration 
strategies and has thus required substantial additional development. 

3.1 Basic modifications due to discretization 

Our base discretization uses a method-of-lines approach, where spatial operators are first 
discretized using a nominally fourth-order spatial discretization and then the resulting semi- 

1 Vlasov Adaptive Limited High-order Algorithms for Laser Applications 
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discrete system is integrated using the standard four-stage, fourth-order explicit Runge- 
Kutta method. Fortunately, for a high-order finite-volume implementation, the restriction 
algorithm to obtain a coarse cell average from fine cell averages remains the simple summa- 
tion used for lower-order schemes. 



3.1.1 Synchronous, multi-stage time advancement algorithm 

In practice, an asynchronous process with time step subcycling on finer cells is typically 
used for explicit, space-time discretizations [12], but we chose to start with a synchronous 
update for simplicity. For a synchronized update (i.e., a single At for all levels), the 



RK4 algorithm (15)-(16) for a conservation law on a single-hierarchy is summarized in 
Algorithm [TJ Looping over stages, the predictor states and fluxes are computed, and the 
fluxes are accumulated. The predictor-state algorithm is laid out in Algorithm [2j and the 
flux divergence and accumulation algorithm is sketched in Algorithm [3j We note that, for 



this conservative form, we accumulate a flux variable as in (16) so that we can construct 
a flux divergence using a temporally fourth-order flux for the final update. Such flux 
accumulation eliminates an explicit re-fluxing step, since the final update can be done from 
finest to coarsest levels, and the accumulated flux can be averaged down so that a single 
update using the highest-quality flux can be done on each level. The update is computed 
from the accumulated fluxes as shown in Algorithm [4j and regridding is done if a user- 
defined number of time steps have elapsed. 



Algorithm 1 Multi-Level, Single-Hierarchy Flux-Divergence RK4 Advance 
k <- 

for all Stages s <— 1, 4 do 

COMPUTEPREDICTOR.STATE(fc, S, fpred) 

CoMPUTERHS(/ pred , s, k, F accum ) 
end for 

COMPUTEUPDATE(F accnm , f new ) 

if time to regrid then 

Regrid all levels 
end if 

Compute next At 



Algorithm 2 Multi-Stage Predictor State Computation 
procedure ComputePredictorState(£;, s, f we d) 
t <- t i d + c s ■ At 

fpred «~ fold + U s ■ At ■ k 

Interpolate up to ghost cells on finer levels of f pre d 
Exchange ghost cells on each level of f pre d 
Apply boundary conditions to fpred 
end procedure 
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Algorithm 3 Multi-Stage Right-Hand Side Evaluation and Flux Accumulation 
procedure COMPUTERHS(/ pre(i , s, k, F accum ) 
for all Levels I <— 1, L do 
for all Patches p do 

Fpred <- COMPUTEFLUXES(/ pred ,t) 
end for 

Exchange fluxes between patches on level I 
end for 

for all Levels I <— L, 1 do 
for all Patches p do 

k «— FLUXDlVERGENCE(F pre( i) 
Faccum 4 F accum + b s • Fp re d 
end for 
end for 
end procedure 



Algorithm 4 Multi-Stage, Multi-Level Flux-Divergence Update Computation 
procedure COMPUTEUPDATE(F accum , f new ) 
for all Levels I <— L, 1 do 
for all Patches p do 

(5/ «— FLUXDlVERGENCE(F accum ) 
fnew ^— /oZd + At • 5/ 

Coarsen fluxes down to level I — 1 
end for 
end for 

Coarsen fine data down for f new 
end procedure 



To integrate the Vlasov-Poisson system, the time advancement algorithm must be 
adapted to allow the simultaneous advancement of multiple phase-space hierarchies. In 
addition, the Poisson equation represents an instantaneous constraint, and we chose most 
self-consistent strategy of re-evaluating the Poisson equation at each predictor state. An 
alternative possibility is to extrapolate eft in time to avoid some of the intermediate field 
solves and the associated parallel synchronization; investigating this approach is left for 
future work. 

The associated modifications to Algorithm [T] are shown in Algorithm [5] The main 
differences are that the major steps are each now computed for all hierarchies and that 
additional steps to evaluate instantaneous constraints have been inserted on predicted or 
updated states are obtained. Note that we do not recompute the potential until after any 
possible regridding since the regridding step for the configuration space hierarchy is not 
independent of the phase space hierarchies in our current implementation. More details 



about this are given in Section 3.3 



10 



Algorithm 5 Synchronous, Multi-Stage, Vlasov-Poisson Multi-Hierarchy Advance 
k <- 

for all Stages s <— 1, 4 do 
for all Hierarchies do 

ComputePredictorState(A;, S, f pre d) 
end for 

COMPUTElNSTANTANEOUSCONSTRAINTS(/ pre( i, 0) 
for all Hierarchies H do 

CoMPUTERHS(/ pred , <j>, s, k, F accum ) 
end for 
end for 

for all Hierarchies H do 

COMPUTEUPDATE(F accmn , f new ) 
end for 

if time to regrid then 

Regrid all hierarchies 
end if 

ComputeInstantaneous Constraints (/ neU i, 4>) 
Compute next At 



3.1.2 Conservative, limited, high-order interpolation algorithm 

Fine-patch cells are filled from coarse cells either when new fine-level patches are created or 
when fine-patch ghost cells at coarse-fine interfaces are filled. For second-order discretiza- 
tions of first-order differential operators, slope-limited linear reconstruction is generally used 
to obtain fine-cell averages from the coarse grid while controlling non-physical oscillations. 
To obtain a fourth-order reconstruction while controlling oscillations, several techniques 
exist, including least squares |46j . unfiltered [17], and explicitly-filtered [M| high-order in- 
terpolations. We adopt a slightly different approach and make use of standard WEN05 [30j 
interpolants that have been analytically integrated to obtain explicit cell-average interpo- 
lation formulas. 

We assume cell-averaged data u\ on a coarse mesh with mesh size h and an overlapping 
fine mesh with mesh size h', such that 

/'{ /',/>',. j = l,2,...,A (17) 

where each Rj is a positive integer. Our goal is to construct a high-order approximation to 
fine-mesh cell-averaged values u( such that the integral over the fine mesh exactly equals 
the integral over the coarse mesh. In addition, since initialization of fine mesh from coarse 
mesh may be done in regions of high gradients, we seek an adaptive interpolation scheme 
that will inhibit the creation of unphysical oscillations. 

For our fourth-order discretization, the five-point WEN05 scheme is sufficient. In the 
general approach to obtain an interpolation in cell i, one is given five cell-averages Uj+ e , 
e = — 2, — 1,0, 1,2, and a location Xj_!/ 2 < x < x i+ i/ 2 , as shown in Figure [2j It is useful at 
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Figure 2: Relationship of fine to coarse mesh in global index space with a refinement ratio 
of R = 4. The five coarse cells shown are used to determine the cell averages in the four 
fine cells that subdivided cell i. 



this point to define some auxiliary quantities: 

-A+n = Uj+n+l Ui+ n , 



n 



-2,-1,0,1, 



Aj+p — Di +p — A+p-i, V =—1,0,1. 
The algorithm proceeds for a uniform mesh as follows: 

(r) 

1. Compute smoothness detectors p- , r = 0, 1, 2: 



(18a) 
(18b) 



(°) = 


13 




12 


w = 


13 




12 




13 




12 



i+1 



1 

4 



A? + i (A + A-ir, 

A 2 _ x + 1 (3A-1 " A-2) 



(r) 

2. Compute the absolute interpolation weig hts al ', r = 0,1,2: 

a^=d r /(e + ^f, 



(19a) 
(19b) 
(19c) 

(20) 



where do = 3/10, d\ = 3/5, d,2 = 1/10, and e is a small positive value to avoid division 
by zero (typically e = 10 -6 ); 



(r) 

3. Compute the relative interpolation weights uj\ , r = 0, 1, 2: 

/ 2 \ 



^ r) = «i r) /(E« w 

s=0 



(21) 



4. Compute the interpolants vf\x), r = 0, 1, 2: 

/ 



•«(*) = * £ 



m=l 



m— 1 
j=0 



2 

£ 



fl (x- aJi-i/2 - % - 0) 



11 fc(m-Z) 
z=o 



(22) 
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5. Compute the combined interpolant Vi(x): 



(r) ir), , 



(23) 



r=0 



The result Vi{x) is an interpolant constructed from cell average values such that 

x i+l/2 

j Vi(x) dx = hvn. 

x i-l/2 



(24) 



In smooth regions, Vi(x) is an 0(/i 5 ) approximation pointwise; in regions where under- 
resolution generates oscillations, the scheme drops to 0(/i 3 ) pointwise (at worst) by adapt- 
ing its stencil through the nonlinear weights so as to bias towards interpolations that are 
less oscillatory. 



For adaptive mesh refinement, we can analytically integrate (22) over the fine- mesh cells 



to arrive at simple algebraic equations for the fine- mesh cell- averages. For a refinement 



ratio of R, we integrate (22) over each of the intervals [xj_i/2 + sh/R, Xj_i/2 + (s + l)h/R], 
s = 0, 1, . . ., R - 1, (see Figure [il. Define for r = 0, 1, 2, a\ 
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(25) 



Then the three fine-mesh cell-averaged interpolated values in cell (Ri + s) are 



l Ri+s 



(r) 



3s 2 + 3s + 1 
6^ 



(26) 



for r = 0,1,2. Note that, to ensure exact conservation to round-off, we renormalize the 
average of the fine cells to the original coarse cell average by 



/ 
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+ Ui 



R-l 
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Ri+s 



(r) 



(27) 



this ensures that the truncation errors are equidistributed amoung the sub-cells. 

In implementation, advantage can be made of the many repeated factors. Notably, for 



(r) 

any given coarse cell i, the fifteen auxiliary variables (18) and (25) and the uj\ only need be 
computed once for the R fine cells in cell i. Similarly, for a fixed refinement R, the 2{R — 2) 
functions of s = 0, 1, • • • , R — 2 in (26 ) are the same for any coarse cell i. 

The direct, though not most efficient, extension of the one-dimensional algorithm to 
multiple dimensions is to apply the method dimension-by-dimension. Thus, it is sufficient 
to build code to handle the ID problem, and the multi-dimensionality is handled through 
data management, i.e., the input provided to the ID routines and the memory destinations 
to which the results are written. 

Consider the 2D case where the refinement ratios are Rq and R\ in the xq- and x\- 
directions, respectively. An example with Rq = Ri = R = 4 is shown in Figure [3^a) . In 
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: 



(a) 



(b) 



Figure 3: (a) The twenty- five coarse cells in 2D used to interpolate the sixteen fine-cell 
averages in cell i for a uniform refinement of R = 4. (b) The cells involved in the partial 
interpolation in the x-direction for cell i. The result are the four cell averages that are 
fine in the x-direction but coarse in the y-direction. (c) The cells involved in the partial 
interpolation in the y-direction for sub-cells at fine-grid location Ri. This is repeated for 
all fine-grid locations in the x-direction. 



cell i, we first compute cell averages for cells refined only in x$ using (26). This is shown in 
Figure [3^b). The result for each cell i is Rq new sub-cell values. 

The same operation is then applied in xi-direction, but the input values are no longer 
the coarse-grid averages, but are now the partially-refined averages from the previous step. 
This is shown in Figure [3](c) . The result for each fine cell Ri + s is R\ sub-cell values, and 
since there are Rq xi-interpolations per coarse cell i, RqRi sub-cell values. 



3.2 Inter-hierarchy transfer operations 

Data transfer between hierarchies of different dimensionality requires the formulation of 
special algorithms and auxiliary data structures. While the injection of lower-dimensional 
data in the higher-dimensional space is a straight-forward constant continuation in the 
new dimensions, the reduction of higher-dimensional data into the lower-dimensional space 
requires the application of an operator across the dimensions that are removed, such as the 
integrals in the moment reductions ([3]). 

The application of reductions across an AMR hierarchy is not in itself new. For example, 
the computation of mathematical norms is frequently executed on AMR hierarchies, and 
any norm is the reduction of higher-dimensional data into a scalar value. Lower-dimensional 
slices across a hierarchy are often used for visualization. A special case of such a slice reduc- 
tion was developed for the laser-plasma interaction code ALPS [H], where the paraxial light 
wave sweeps required plasma densities on lower-dimensional planar slices. The challenge for 
the Vlasov system is that the reductions are the result of accumulation. Spatial fidelity and 
accuracy must be maintained while accumulating across the velocity dimensions, and such 
reductions must be done efficiently and without double-counting (recall the overlapping 
mesh hierarchy shown in Figure [TJ. 

In addition to the need to obtain data at equivalent resolution, the act of orchestrating 
a reduction operation across a hierarchy in parallel requires several auxiliary structures. In 
fact, to preserve generality in the mesh refinement, auxiliary data structures are also helpful 
for injection operations. We next discuss two moment reduction algorithms that have been 
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Figure 4: The configuration-space composite grid corresponding to the composite grid Qq 
depicted in Figure [T] after reduction. 

developed for the Valhalla code, followed by a brief description of the associated injection 
algorithm. 

3.2.1 Moment reduction algorithm 

Consider the composite grid depicted in Figure [T] Let us assume that we will accumulate on 
coarse- grid index j for coarse- grid index i = 1. Along i = 1 there are cells of two resolutions 
since there are two cells, j = 7, 8, that have been refined. If we do not preserve the finest 
resolution in the accumulation on j, we will lose some known information about the spatial 
structure in the remaining ^-direction. To preserve the finest-resolution spatial information, 
we should subdivide in the x-direction to obtain cells of uniform resolution (in i). Using 
this principle, the corresponding composite grid after reduction is shown in Figure [4j 

One might consider subdividing coarse cells without spatially reconstructing the data 
within the coarse cell, but this would result in an 0(h) error in the summation. To see this, 
consider that, for a refinement ratio of R, the relationship between a fine grid cell average 
and a coarse grid average in a single dimension is 

u f m+s = Ui - h ( - ^ d x u\ i+ 2^ +0(h 2 ) , (28) 

where s = 0, 1, . . . , R — 1. Thus, to preserve higher-order accuracy, all coarse data must be 
reconstructed before averaging to the finest resolution and before applying the reduction 
operation. 

One can make use of the limited interpolation operators defined in Section |3.1.2| How- 
ever, the data to be reduced should be well-resolved (or else it would have been refined), so a 
less expensive option is to use a linear interpolation. One can construct such an interpolant 
by averaging over sub-cells the fifth-order interpolant, 



with 
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E 7j(?)*ii+i + 0(/» 5 ), 


(29) 


7-2 (v) 


= (5r/ 4 - 20r/ 3 + 15r? 2 + 10r? - 6) /120, 


(30a) 


7-1 (v) 


= (-20r/ 4 + 60r? 3 + 30r/ 2 - 150?? + 54) /120, 


(30b) 


loiv) 


= (30r/ 4 - 60r? 3 - 120r? 2 + 150r/ + 94) /120, 


(30c) 


niv) 


= (-20r/ 4 + 40r? 3 + 90r/ 2 - 10?? - 26) /120, 


(30d) 


72 (??) 


= (5r/ 4 - 15r/ 2 + 4) /120, 


(30e) 
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as was done to arrive at (26). The resuling formula for the fine- mesh interpolations is 
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bj(s)u i+j + 0(h 5 ) 



-J 



u Ri+s 



i=-2 



with 



(31) 



b-2(s) 


= (P4(a) 


- 5p 3 (s) + 5p 2 (s) + 5 Pl (s) - 6) /120, 


(32a) 


b-l(s) 


= (-4p 4 


a) + 15p 3 (s) + 10p2(a) - 75p x (s) + 54) /120, 


(32b) 


bo(s) 


= (6p 4 (s 


) - 15p 3 («) - 40p 2 (s) + 75pi(a) + 94) /120, 


(32c) 


h(s) 


= (-4p 4 


s) + 5p 3 (s) + 30p 2 (s) - 5pi(s) - 26) /120, 


(32d) 


b 2 (s) 


= (P4(a) 


-5p2(s)+4)/120, 


(32e) 



where 

At this point, it is helpful to introduce some notation. We denote an iV-vector of 
integers by i = (io, ■ ■ ■ , ijv— l) £ Z^ and a patch, "P, by a pair of iV-vectors that indicate 
the lower and upper cells of the patch: V = [i\ , ihi]- The restriction from an N- vector to a 
(N — l)-vector by removing the j-th element is restrj, e.g., 

restrj i = (i ,k, ■ ■ . . . ,ijv-l) G Z^ -1 . (34) 

and we define restr^P = [restr^ ii G , vestij i^]. We also define a replacement operator 
replj(a,6) that operates on patches and that replaces the j-th element of the lower and 
upper iV-vectors by a and b, respectively: 

Teplj(a,b)V = [(i ,ii,.. . ,ij-i,a,i j+ i, . . .,i N _i), (i ,ii, ■ ■ ■ , ij-i, b, ij+i, ■ ■ ■ (35) 

A collection of patches at a refinement level I is denoted by L\, and a patch hierarchy Qh 
is defined as the set of refinement levels with an iV-vector refinement ratio defined between 
each level and the coarsest (1 = 0), i.e., 

g H = {C h < I < L - 1 : 3 t\ +1 €Z n , 0<l<L-2}. (36) 

The directional refinement and coarsening operators, lZ°j ,b and C^' b , respectively, refine and 
coarsen the j-th direction of a patch by the ratio defined between refinement levels a and 
b, that is, 

■JlfV = repl, [Rfij , Rf(ij + 1) - \)V, (37a) 
C fv = repl^Cf 'ij, [Cfiij + 1) - 1J )V, (37b) 

where R f = Ulo^h/Uto^ 1 ), and Cf = 1/Rf. 

Again referring to Figure [TJ we wish to compute the reduction on the composite grid 
Qc, but in fact we have the hierarchy grid Qh- The standard technique for computing 
a reduction across a hierarchy without double counting is to use a mask to zero out the 
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contributions from coarse grid regions that are overlapped by fine grid. Let Pi be the 
number of patches in level £/; let Xj' b be the the interpolation operator in direction j that 
refines the local data from the resolution of level a to that of level b; and let (if be the 
masking operator on patch V v that sets the data in cell i to zero if the cell is covered by a 
patch at a finer level. We further assume, without loss of generality, that the phase-space 
integer vector i is ordered in such a way so that elements through N — 1 correspond to 
configuration-space indices, and elements N through N + M — 1 correspond to velocity 
space indices. 

The reduction operation to construct the charge density is 

N+M-Ud-l N+M-l L-lPj-l id,hi N-l 

= E E/' = 1 - y i E EE E *n%-'t- 

d=N j d =o d=N i=o P =o jd= f d io d'=0 

L-lPi-l N+M-l J'd,hi N-l 

-i-*ee E E «; IP;; : * 

z=o P =o d=jv id =^ lo d'=0 

where i c = ridSv^ -1 restr^ i G Z w is the configuration space index and where V v = 
ridSv^ -1 hd- I n words, the distribution function on each patch p is first refined up to 
the finest level in the configuration space directions, then masked, and then accumulated. 
Note that the mask operator and the interpolation operator do not commute; after summing 
over the velocity -space indices, jd, with a mask, the resulting partial sums would in gen- 
eral have discontinuities at mask boundaries. Thus, using a masking procedure, one must 
reconstruct in the higher-dimensional space, which is more expensive that reconstructing in 
the lower- dimensional space. 

To facilitate the inter-dimensional communication, two intermediate data structures are 
used. The partial reduction level is a level of overlapping patches at the finest resolution 
in configuration space, where the patches are projections of every patch in the phase-space 
hierarchy: 

fN-l N+m-1 > 

^partial = I ]J V}^ 1 ]J Te S tl d V, V V € Q H \. (39) 
[d'=0 d=N J 

The total reduction level, Aotab disjoint covering of patches of the configuration space 
domain that, aside from the resolution, is independent of all hierarchies. These structures 
are created given a phase-space hierarchy Qh and configuration space-hierarchy Q C H and 
exist until regridding occurs. 
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Figure 5: Graphical depiction of the Mask Reduction Algorithm. 



Algorithm 6 Mask Reduction 
for all levels C £ Qh do 
for all patches V G C do 

Refine data in non-reduction directions 
Mask out covered cells 

Find patch V P (V) £ ^partial corresponding to patch V 
Sum reduce to partial sum patch V p 
end for 
end for 

for all patches Vt £ Aotal do 

Accumulate data from co-located patches V p 
end for 

for all levels C £ Q C H do 

for all patches V G C do 

Copy data from co-located patches V% 

end for 
end for 



In Figure [5j a diagram of the Mask Reduction Algorithm, Algorithm [6j is presented. 
The algorithm proceeds as follows. For each patch in the phase space hierarchy, the data 
on the patch is first refined in those directions that will remain after the reduction. The 
covered regions are then masked out; this is accomplished by using masks that have been 
precomputed using a modified communication algorithm^] and stored in the phase space 
hierarchy. A one-to-one mapping is used to obtain the configuration-space partial-sum 
patch V p from the pre-computed partial summation level £ pa rtiai that corresponds to the 
current phase space patch V . The summation is then executed, and the results are placed 

2 The mask variable is set to unity everywhere. Then a standard communication algorithm from fine to 
coarse levels is executed on the mask variable in the phase space hierarchy, but instead of copying fine data 
to coarse cells where overlap occurs, zeros are copied into the coarse cells. 
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in the configuration-space partial summation patch V v . Once all patches in the phase space 
hierarchy Qu have been reduced, a communication from the partial summation level £ pa rtiai 
to the total reduction level Aotal is executed using an accumulation operation]^] The total 
reduction level at this point contains the total reduction on a set of disjoint patches at the 
finest resolution. A standard communication operation from the total reduction level Aotal 
to the configuration space hierarchy Qjj completes the data transfer. 

We note that the total reduction level is not necessary. One could communicate directly 
between the partial summation level and the configuration space hierarchy using an accu- 
mulation operation. However, for clarity and ease of implementation, we favored the use of 
an intermediate total reduction level. 

While the Mask Reduction Algorithm is simple to implement, one might suspect that it 
is not as efficient as it could be because interpolation is done in phase space to the original 
data. Instead, consider execution of the reduction on the composite grid, Qc- Let Pq be 
the number of patches in Qc- In a single dimension: 



N+M-Ud-l Pc-1N+M-1 J d,hi N-l 



«c = i-k E £* = 1 -^E E E n^f X 

d=N j d =0 p=0 d=N jd =f dlo d'=0 



Pc-lN-l N+M-l 3d,hi 



k E II E E si m 



p=0 d'=0 d=N 
P C -1 N-l 

- 1 " E n ^ l el 

p=0 d'=0 

where, since the composite grid has no levels, the level index I is a function of the patch 
index p and is merely a label indicating the refinement relative to the coarsest patches. 
Note the savings of the last step; instead of applying the potentially costly prolongation 
operator at every grid level j, it is instead applied to the lower-dimensional partial sums 
on each patch. To achieve this simplification, however, an efficient algorithm is needed to 
construct the composite grid. 



3 Instead of copying values from each source patch to each destination patch, values are added from each 
source patch to each destination patch. 
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Algorithm 7 Subdivision of patch into composite grid sub-patches 
procedure COMPUTESuBPATCHES('Pi n , £j n , Qh) 
for all levels C G Gh ■ £ > An do 

for all patches V G £ : P n P in / do 

Extend <" HIT" 1 ^Pl rf ((ijS)d, ^ 

for all patches V s G S do 

5 <- 5 - V s + V s n Extend + Vs\(V s D ^extend) 

end for 
end for 
end for 
return 5 
end procedure 



Such a procedure based on basic box calculus operations is presented in Algorithm [7J 
Given a patch V{ n and the hierarchy in which it resides, a set of sub-patches S is to be 
constructed. Initially, the set of sub-patches is just the original patch V m . All patches V 
from finer levels that overlap the patch are found. In the SAMRAI library, these relation- 
ships are already known and can be obtained directly without searching. Each overlapping 
patch, then, is extended in the reduction directions, for example, that is, its lower and upper 
indices in the reduction directions are replaced by the lower and upper limits of the input 
patch Vm". 

N+M-l 

V extend = J] repl d ((t)d,($a)d)V. (41) 

d=N 

The extended overlapping patch 7-extend is then intersected with each patch V s G S, and 
both the intersections and the complements replace the patch V s in the set. The original 
overlap patch V is then subtracted from the set S. The extension ensures that sub-patches 
of the greatest extent in the reduction directions can be formed and that subsequent removal 
of overlap patches results in rectangular sub-domains. 
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Figure 6: Graphical depiction of the Sub-Patch Reduction Algorithm. It is those steps in 
the light purple box that differ from the Mask Reduction Algorithm. 



Algorithm 8 Sub-Patch Reduction 
for all levels C G Qh do 
for all patches V G C do 

Find patch V P (V) G £ pa rtiai corresponding to patch V 
S «- ComputeSubPatches^A^h) 
for all sub-patches V s G S do 

Grow ghost cells in non-reduction directions 
Sum reduce data, including ghost cells, to temporary patch Ptmp 
Refine data on T^mp in non-reduction directions 
Copy data on interior of Ptmp to partial sum patch V p 
end for 
end for 
end for 

for all patches Vt G Aotal do 

Accumulate data from co-located patches V v 
end for 

for all levels C G Q C H do 

for all patches V G C do 

Copy from co-located patches Vt 

end for 
end for 



Using this sub-patch construction procedure, the more efficient Sub-Patch Reduction 
Algorithm presented in Algorithm [8] and depicted in Figure [6] can be used. As before, 
loops are performed over all patches, but now, for a given patch, the set of sub-patches 
are identified. Each of these sub-patches is first grown in the non-reduction directions by 
the number of ghost cells necessary for any subsequent prolongation operations. The data 
on each sub-patch, including the ghost cells, is sum reduced to a temporary configuration- 
space patch of the same resolution. This partial sum data is then refined, and the result is 
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Figure 7: Injection 

copied into the corresponding partial sum patch from the partial sum hierarchy. Once all 
contributions from all sub-patches are obtained, the reduction algorithm proceeds as before. 

Finally, we note that neither reduction algorithm assumes that either the phase-space or 
configuration-space hierarchies have a special structure. Because intermediate data struc- 
tures are used along with standard communication algorithms, arbitrary meshes could be 
used in either dimensions. In addition, these algorithms are applicable in parallel and for 
arbitrary dimension. 

3.2.2 Injection algorithm 

The process of injection from lower to higher dimensions is much simpler. The data, Ei, for 
instance, is the same for all phase space locations at index i, that is, E{j = Ei. Nevertheless, 
to facilitate the data transfer between hierarchies of different dimensions, it is convenient 
to first construct an intermediate configuration-space restricted hierarchy, Q r H = {£[,0 < 
I < L — 1}, where 

{N+M-l } 
11 restr d P, VP € Q H \, (42) 
d=N J 

that is, it is composed of lower-dimensional restrictions of all of the patches in the phase 
space hierarchy Qh- 
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Algorithm 9 Injection 
for all Levels C G Q' H do 
for all Patches p £ C do 

copy from co-located patches p c G ^ 
end for 
end for 

for all Levels C G Qh do 
for all Patches p G £ do 

find patch p r (p) G corresponding to patch p 
either copy from p r into p or use directly 
end for 
end for 



For completeness, the injection transfer algorithm is depicted in Figure [7] and presented 
in Algorithm [9| Standard communication copiers are used to fill the restricted hierarchy 
Q r H from the configuration space hierarchy Q C H . A one-to-one mapping exists from every 
restricted hierarchy patch to the corresponding phase space hierarchy patch. The restricted 
hierarchy data can then be injected into phase space data, e.g., Eij <— E; L . This wastes stor- 
age with many repeated values, so in the Valhalla code, we directly access the restricted 
hierarchy data when needed. 

3.3 Regridding for multiple AMR hierarcies 

Working with multiple hierarchies introduces regridding challenges for AMR algorithms. 
With a single hierarchy, the user typically defines a regrid frequency. At user-defined times, 
refinement criteria are used to identify cells in need of refinement (coarsening) , new levels of 
rectangular patches are formed containing these flagged cells and are populated, and these 
new levels replace old levels in the hierarchy. With multiple hierarchies, one must decide 
to what degree to constrain the regridding of each hierarchy. Considerations include the 
facilitation of efficient communication between hierarchies, the cost/benefit of adaptation 
for each hierarchy, and the degree and nature of dependencies between hierarchies (e.g., 
can the hierarchies refine independently, and if not, are the dependencies one-way or more 
complicated?) In the case of Vlasov simulation, coordination is most critical between the 
configuration space hierarchy and the phase space hierarchies, where a variety of interme- 
diate data structures are required to execute inter-dimensional data transfer. 

For the purposes of demonstrating proof-of-principle, we made several simplifying choices. 
For 1D+1V Vlasov-Poisson, the electrostatic problem is solved in ID. When restricted down 
to ID, mesh refinement of features in 2D, such as particle trapping regions, will typically lead 
to mesh refinement almost everywhere; hence, there is little advantage to mesh refinement in 
configuration space. Furthermore, the cost of the higher dimensional solve by far dominates 
the total cost, so there should be little advantage to using mesh refinement to speed-up the 
lower-dimensional solve. We therefore elected to require the configuration space mesh to be 
uniform at the finest level of refinement of the phase space hierarchy. While the mesh was 
decomposed into patches, we did not distribute the configuration space hierarchy across 
multiple processors. For higher-dimensional problems, such as 2D+2V Vlasov-Poisson, one 
may benefit from distributing the configuration-space hierarchy. However, such a distri- 
bution will be over a much smaller number of processors than the phase-space hierarchy 
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Figure 8: Unified Modeling Language depiction of the Observer Design Pattern used 
to create notifying hierarchies. In our case, the reduction and injection algorithms for 
configuration-space are HierarchyObservers. These observers register themselves with the 
phase space Notif yingHierarchy to receive notices about regridding. 



simply because there is so much less data and work to distribute in lower dimensions]^] 
When the phase-space hierarchy was regridded, the configuration-space hierarchy was only 
regridded if a new level of refinement was added (removed) in phase space. This scheme 
had a secondary advantage of simplifying the Poisson solver; solves were executed on the 
uniform, finest mesh level in configuration space and then averaged down to coarser levels, 
thereby avoiding the need for a Fast Adaptive Composite (FAC) iterative algorithm [49J. We 
note that these are merely choices and not requirements; the algorithms for inter-hierarchy 
data transfers defined in Section 3.2 support more general configuration and phase-space 
hierarchies. 



3.3.1 Managing inter-hierarchy coordination 



From the descriptions of the reduction and injection transfer algorithms in Section 3.2, it is 
clear that the intermediate data structures, such as the partial sum level or restricted patch 
hierarchy, are dependent on the phase and configuration space hierarchies. When regridding 
of any of the primary hierarchies occurs, the intermediate data structures must be rebuilt 
in order to maintain consistency. To facilitate this, we made use of the Observer Design 
Pattern [50] as depicted in Figure [8] The SAMRAI concept of PatchHierarchy was gener- 
alized to allow other objects, such as the ReductionAlgorithm and InjectionAlgorithm, 
to subscribe to the phase space hierarchy in order to receive messages indicating that the 
phase space hierarchy had regridded. Reconstruction of the intermediate data structures 
is deferred until a subsequent reduction or injection operation is attempted and a message 
from an observed hierarchy is found. 



3.3.2 Mesh Refinement Criteria 

Finally, selection of mesh refinement criteria can be critical in obtaining optimum AMR 
performance. For our purposes here, we chose to apply common heuristic refinement criteria 

4 For a uniform-grid 2D+2V Vlasov-Poisson code, we have seen in practice that the Poisson solve benefits 
from distributed parallelism only when the problem size has grown such that the Vlasov solve occurs across 
several thousand processors. 
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to the phase space distribution function. Specifically, we tag cells when 



Sifi + <5 2 /i > tol 



(43) 



where 



2 



and <5 2 /i = - ^ Aa^|/ i+e d - 2fi + /j 



i-e 



(44) 



d=l 



estimate the first two truncation error terms. We do not claim that this is the optimal 
choice; it is merely sufficient to demonstrate our algorithms. Other error indicators could 
be used, including indicators based on physical principles, such as local estimates of the 
location of the trapped-passing boundary. The choice of optimal refinement criteria is 
intimately related to problem-specific quantities of interest, so we leave this topic for future 
work. 

4 Numerical Results 

We present results from a Valhalla simulation of the bump-on-tail instability |51| §9.4] 
as a basic proof-of-principle of the block-structured AMR approach for Vlasov-Poisson sim- 
ulation. We used the same problem specified in our previous discretization work [39J. The 
initial distribution function was given by 



The (x,v) domain was [— 107r/3, 107t/3] x [—8, 10] and was periodic in the ^-direction. We 
initialized the solution with a coarse grid of N x x N v = 16 x 32 and with an initial refinement 
in the box [(0, 8), (15, 24)]. This initial mesh configuration allowed for larger time steps, 
since the cells along the maximum velocity boundary have a larger aspect ratio. The initial 
time step was Ato = 0.01, and this was allowed to adjust to 50% of the local stability 
condition based on the linear stability of the fourth-order scheme (See p38j). Time steps 
could increase no more than 10% from their previous value, but could decrease by any 
amount. The grid refinement criteria tolerance was tol = 0.01, and grid refinement ratios 
of rj = [2,4], r\ = [4,2], and r| = [2,2] were used. Up to four levels of AMR mesh were 
allowed. To isolate the AMR performance issues, we consider the serial performance on a 
single node of the LLNL 64-bit AMD Linux cluster hera. 

AMR performance is very problem-dependent. When small regions of refinement are 
required, in particular, when there are lower-dimensional features in the solution, AMR is 
generally a net win. However, there is overhead associated with AMR for which sufficient 
problem size reduction is necessary to achieve a net gain in simulation performance. Per- 
formance is also highly dependent on the choice of parameters, such as regrid frequency 
and refinement tolerances, so the results presented here are meant to demonstrate that our 
Vlasov-AMR procedure works and can show savings. Whether or not AMR is useful in 



f = f b (v) (1 + 0.04 cos (0.3s)), 



(45) 



with 




(46) 
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Parameter 


AMR1 


AMR2 


AMR3 


~\ arcrp q1~ tip 1~ fVi qi 70 








level.O 


(32,32) 


(16,32) 


(16,32) 


level_l 


(64,64) 


(32,128) 


(32,128) 


level_2 


(64,64) 


(128,256) 


(128,256) 


smallest _patch_size 


(4,4) 


(8,8) 


(8,8) 


regrid_interval 


2 


4 


8 


tag.buf f er 


(1,1) 


(4,4) 


(8,8) 



Table 1: AMR parameters used to define the three test configurations. The parameters 
largest_patch_size and smallest_patch_size control the largest and smallest allowable 
patch sizes level-by-level; if unspecified, the finest specified value is applied to all subsequent 
levels. The parameter regrid_interval is the frequency, in time steps, at which regridding 
occurs. Finally, tag_buf f er is the number of buffers cells to add around a region tagged 
for refinement to facilitate less frequent regridding. 



other specific cases and optimal choices for AMR parameters and regridding criteria are 
very important issues. 

To help elucidate the AMR performance, we considered three AMR parameter config- 
urations, as shown in Table |4j The AMR1 case represents an attempt to minimize the 
number of refined cells by using smaller patches and more frequent regridding. The AMR2 
and AMR3 cases allow for larger patches an less frequent regridding. Thus, these three cases 
can give some sense of the trade-offs between reducing the amount of mesh (AMR mem- 
ory reduction) and reducing the run time (AMR speed-up). All cases were run using the 
Sub-Patch Reduction algorithm with unlimited fifth-order reconstruction unless otherwise 
noted. 

Figure [9] shows computed approximations of the phase-space distribution function at 
t = 22.5 for case AMR1. As expected, we see a concentration of the mesh only in regions 
of most rapid variation in the solution, and conversely, we see mesh coarsening in the 
the trapping region. At this point in the calculation, the total number of cells is 40784, 
compared to 131072 cells in the equivalent fine grid - a reduction of approximately 69%. 
In Figure [TO] , we show the time history of the number of AMR cells plotted against the 
instantaneous equivalent fine grid for each AMR parameter configuration. We see that once 
adaptivity starts, we can achieve an average reduction of between forty and sixty percent, 
depending on the AMR parameters. As expected, the AMR1 case uses the least number 
of cells, and the AMR3 case, because of its increased patch size and tagging buffer, uses 
the most cells. Considering Figure [9| we see that a lot of the mesh reduction comes in the 
velocity (vertical) dimension, and this is expected for each additional velocity dimension in 
higher dimensions. In 1D+1V, there is little localization in configuration space. However, 
in 2D+2V, there is also the opportunity for spatial localization of the features, which would 
result in even more mesh reduction. 

In Figure 1 1 , we present the time history of the maximum of the electric field for the 
bump-on-tail problem for three different resolutions. This metric is a fairly sensitive measure 
of numerical fidelity. In addition to the three AMR parameter cases, we plot the results 
from three uniform-grid cases: 64 x 128, 128 x 256, and 256 x 512. We use the finest of 
these as a "reference" solution to plot the discrepancy of the electric field maximum. For 
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Figure 9: Example result for the bump-on-tail problem at time t = 22.5 for the AMR1 
case. On the left, the distribution function is shown. Boxes outlined in white denote AMR 
patches. On the right, the corresponding four- level AMR mesh is shown. The mesh adapts 
to resolve the particle trapping region as it forms. Note that the minimum distribution 
function is small but negative; no positivity enforcement schemes were used in this calcula- 
tion. 

the AMR calculations, the finest local resolution is equivalent to the 128 x 256 uniform 
mesh. At early times, when the solution has little structure, all of the solutions agree 
well. The small up/down differences in the AMR results before t = 10 are due to the 
discrete temporal resolution (the AMR cases use larger time steps) of the first two valleys 
of the maximum electric field. Around t = 25, one begins to see significant differences in 
the coarsest solution, since it cannot resolve as well the features being generated in the 
particle trapping region. We can conclude from these results that the 64 x 128 resolution 
was insufficient to accurately track the maximum electric field over this time interval; thus 
the increased resolution of the AMR is necessary. 

By about t = 50, one sees a growing discrepancy between all of the AMR cases and the 
equivalent uniform mesh of 128x256; over the interval considered, the discrepancy is roughly 
twice as large at its maximum. One explanation for this could be the accumulation of error 
over longer integration times. Another likely explanation is that we are not capturing 
all of the relevant details with the refined mesh because we are using a simple heuristic 
gradient detection algorithm; more problem-specific refinement criteria may perform better. 
Nevertheless, the AMR results do track the equivalent uniform mesh results well. Compared 
to the finest uniform grid results, the phase of the AMR results is relatively good, but 
the amplitude is being under-predicted by an increasing amount over time; there will, of 
course, be slightly more dissipation in the coarser results when features appear that cannot 
be adequately resolved. These results show that AMR can provide effectively equivalent 
results as a uniform mesh. Of course, one must consider the quantities of interest for the 
calculation, and suitable choices of AMR parameter and refinement criteria need to be 
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Figure 10: Time history of the number of cells for the bump-on-tail simulation for the three 
AMR parameter configurations. The dashed curves are the number of cells in an equivalent 
uniform grid based on the current maximum refinement level, while the solid curves are the 
actual number of cells in the AMR hierarchy. Note that around t = 3, 6 and 9, there is some 
intermittency in all cases, as the adaptivity adds and removes a small number of patches 
at the next finer level. 

selected. 

In addition to mesh reduction, the potential for decreased time-to-solution by using 
AMR is also of interest. As indicated earlier, AMR should have the benefit that the equa- 
tions are integrated on fewer cells and that larger time steps can be taken. However, 
traditional AMR incurs additional overhead from the regridding and the data transfers 
(communication, interpolation, and averaging) between patches on different levels. The 
Vlasov-Poisson system has additional overhead due to the reduction and injection opera- 
tions between different dimensions. 

In Table |4j we provide run times on the hera cluster for the three AMR parameter 
cases in comparison to the run time for the equivalent uniform mesh. The AMR1 case, with 



Case 


Time 1 (s) 


Time 2 (s) 


Time 3 (s) 


Avg Time (s) 


Speed-up 


Uniform 


1521 


1520 


1519 


1520 


1.00 


AMR1 


2622 


2615 


2632 


2623 


0.58 


AMR2 


1355 


1346 


1336 


1346 


1.13 


AMR3 


960 


961 


960 


960 


1.58 



Table 2: AMR speed-up on the hera cluster for the three AMR parameter cases. The 
average of three results for each case are compared to the average run time for three instances 
of the same problem solved using the equivalent uniform mesh. 
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Figure 11: Time history of the maximum electric field for bump-on-tail calculations at 
several resolutions. Results are from the three AMR configurations in which the finest 
resolution is equivalent to a 128 x 256 uniform mesh as well as reference uniform-grid 
calculations on meshes of 64 x 128, 128 x 256, and 256 x 512. The left plot shows the 
maximum of the electric field. The right plot is the same data, plotted as the difference 
from the results from the fine 256 x 512 mesh. 



refinement every other step, causes significant slow-down of the code; however, in the other 
two cases, the time to solution is reduced. As expected, when the regridding frequency is 
reduced, the speed-up increases. We note that we have erred conservatively in favor of the 
uniform mesh solution; running serially and with a single patch, it incurs no inter-patch 
communication costs. 

The reasons for the slow-down are obvious when looking at the cost of certain key 
operations, as shown in Table |4j Note that more than half the time of the uniform grid 
calculation is the routine computeRHSO . In contrast, the amount of time in this routine is 
significantly reduced for all AMR calculations, as one would expect, since this routine will 
scale with the number of cells. That the AMR time in computeRHSO is no more than 24% 
of the total time suggests that the AMR cases are spending a great deal of time in AMR 
overhead. 

One obvious source of overhead is regridding, the cost for which is accounted in regridHierarchie 
As expected, we see that the AMR1 case spends the most time in regridding while the AMR3 
case spends the least. The absolute total time spent in AMR1 regridding is more than ninety 
percent of the total time the uniform mesh case spends in computeRHSO. However, for the 
AMR3 case, which regrids on every eighth step, the regridding cost is much more reasonable. 

The f illDataO routine is the top level routine that applies boundary conditions (hence 
the non-zero cost even for the uniform grid case), fills patch ghost cells on the interior of 
the domain, and fills new patches from patches at the same (i.e., copy) or coarser (i.e., 
interpolate) mesh levels. While fillDataO accounts for only 6% of the uniform mesh 
calculation total time, it represents 40-60% of the total time for the AMR calculations. 
One of the routines that constitutes a significant portion of f illDataO is the finite volume 
WENO refinement in WEN0_2D(); it can be seen that with less frequent refinement, less 
time is spent in this routine. Note that, while the absolute time in WENO interpolation 
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Case 


Exclusive Time (s) 


Total Time (s) 


MultiStagelntegrator : : computeRHSQ 


Uniform 


830.6 (55%) 


830.7 (55%) 


AMR1 


169.0 (6.4%) 


169.1 (6.4%) 


AMR2 


170.5 (13%) 


170.6 (13%) 


AMR3 


236.0 (25%) 


236.0 (25%) 


MultiStagelntegrator : : evaluatelnstantaneousConstraints () 


Uniform 


16.40 (1.1%) 


115.0 (7.6%) 


AMR1 


68.89 (2.6%) 


272.6 (10%) 


AMR2 


19.63 (1.5%) 


132.5 (9.9%) 


AMR3 


12.35 (1.3%) 


89.76 (9.4%) 


xfer: : Ref ineSchedule : :fillData() 


Uniform 


32.25 (2.1%) 


94.65 (6.2%) 


AMR1 


21.95 (0.83%) 


1564. (59%) 


AMR2 


14.54 (1.1%) 


751.9 (56%) 


AMR3 


14.00 (1.5%) 


384.3 (40%) 


ReductionAlgorithm: :reduce() 


Uniform 


0.8683 (0.06%) 


76.56 (5.0%) 


AMR1 


7.507 (0.29%) 


153.5 (5.8%) 


AMR2 


3.179 (0.24%) 


74.75 (5.6%) 


AMR3 


1.846 (0.19%) 


43.31 (4.5%) 


MultiStagelntegrator : : regridHierarchies () 


Uniform 


0.000 (0%) 


0.000 (0%) 


AMR1 


72.70 (2.8%) 


767.9 (29%) 


AMR2 


17.48 (1.3%) 


203.6 (15%) 


AMR3 


6.150 (0.64%) 


74.41 (7.8%) 


ConservativeWENORef ine: :WENCL2D 


Uniform 


0.000 (0%) 


0.000 (0%) 


AMR1 


639.0 (24%) 


720.2 (27%) 


AMR2 


382.2 (29%) 


411.1 (31%) 


AMR3 


203.7 (21%) 


215.1 (22%) 



Table 3: A summary of the costs of six key routines for the three AMR cases and the 
equivalent uniform mesh. Exclusive time is the time strictly spent in a routine. Total time 
is the time spent in a routine and all subroutines called from that routine. 
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Routine 


Exclusive Time (s) 


Total Time (s) 


ConservativeHighOrderRef ine: :WEN0_2D() 


Mask Reduction 


742.6 (27%) 


824.3 (30%) 


Sub-Patch Reduction 


642.5 (24%) 


727.5 (27%) 


ReductionAlgorithm: : reduce () 


Mask Reduction 


27.14 (0.99%) 


245.7 (9.0%) 


Sub-Patch Reduction 


7.505 (0.29%) 


157.9 (5.9%) 


ConservativeHighOrderRef ine: :WEN0_1D() 


Mask Reduction 


0.00 (0%) 


0.00 (0%) 


Sub-Patch Reduction 


5.399 (0.20%) 


10.88 (0.41%) 



Table 4: Comparison of the processing time of the Mask Reduction and Sub-Patch Reduc- 
tion algorithms. Exclusive time is the time strictly spent in a routine. Total time is the 
time spent in a routine and all subroutines called from that routine. 



decrease monotonically from AMR1 to AMR3, the relative times peak with AMR2; this 
is a trade-off between more mesh and less frequent regridding. Furthermore, we note that 
the cost of limited, high-order interpolation for the intra-hierarchy interpolations necessary 
in AMR is not a cost specific to the Vlasov-Poisson system; an AMR method based on a 
higher-order discretization for any PDE system will need to address the efficiency of such 
interpolations. 

The other two routines shared by all four cases, evaluatelnstantaneousConstraintsQ 
and reduce (), are provided to show that these operations spend roughly the same percent- 
age of time whether for the uniform mesh or for the AMR cases. Note that evaluatelnstantaneousConstraint 
is marginally more expensive for all AMR cases because this routine includes the Poisson 
solve, which requires more complicated reductions across the AMR hierarchies. The AMR1 
case is more expensive in absolute time because it has more patches to reduce. However, 
not that the reductions and constraint evaluations for AMR2 and AMR3 are about the 
same cost as or even cheaper than (in absolute time) the uniform case. 

Finally, in Table |4j we present some timings for the routines related to the two re- 
duction algorithms described in Section 3J2 Note that these results were computed using 
WENO interpolation and the AMR1 parameters. For the Mask Reduction, we note that 
the WENCL2D routine is called for intra-hierarchy regridding and communication and inter- 
hierarchy reduction calls. With the Mask Reduction Algorithm, a great deal of time is 
spent in the 2D interpolation routine, and the reductions account for 9% of the total run 
time. For the Sub-Patch Reduction, the WEND_2D routine is not called, so its reported 
time is strictly from intra-hierarchy regridding and communication calls. With the Sub- 
Patch Reduction Algorithm, the time spent in the WENCL2D routine is reduced by roughly 
10%, and it is replaced by about 1.3% of additional work in the WENCL1D routine. The 
total cost of the Sub-Patch Reduction Algorithm is 64% of the Mask Reduction Algorithm. 
The comparative performance is what was anticipated, although, admittedly, for 1D+1V 
Vlasov-Poisson, the savings are not dramatic. Nevertheless, in higher dimensions, there will 
be a more significant benefit; for 2D+2V Vlasov-Poisson, the Mask Reduction Algorithm 
will require four interpolations in each cell in the four-dimensional mesh (scaling like A 4 ), 
whereas the Sub-Patch Reduction Algorithm will require only two interpolations in cell in 
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a two-dimensional mesh (scaling like iV 2 ). 

5 Conclusions 

We have demonstrated the application of block structured adaptive mesh refinement to 
the 1D+1V Vlasov-Poisson system as implemented in the Valhalla code based on the 
SAMRAI AMR library. The primary complication comes from a solution state comprised 
of variables of different dimensions. The considerations and algorithms required to extend 
standard single-dimensional block structured AMR have been presented. In particular, algo- 
rithms for reduction and injection operations that transfer data between mesh hierarchies of 
different dimensions were explained in detail. In addition, modifications to the basic AMR 
algorithm due to our use of high-order spatial and temporal discretizations were presented. 
Preliminary results for a standard Vlasov-Poisson test problem were presented, and these 
results indicate that there is potential for savings, both in memory and in compute time, 
for at least some Vlasov problems. The effectiveness for any particular problem will depend 
intimately on the features of the solution. 

There are several obvious directions for future work. Currently, we are working on 
generalizing the Valhalla code to 2D+2V and higher dimensions. The SAMRAI library 
is quite general and supports arbitrary dimension. Moving to 4D calculations and beyond 
opens up several new directions for investigation. When the configuration space problem 
is in 2D or 3D, there is potential for savings from adaptivity in configuration space. It 
is straightforward to enable this generalization, but it is unclear if the additional cost 
of the necessary FAC algorithm and of the AMR overhead will justify the complication, 
particularly when the solution time will be dominated by operations in the 4D or higher 
phase space. With larger phase-space problems, an efficient parallel decomposition will 
be necessary; we have already indicated potential advantages of providing each species a 
distinct subset of processors, but empirical results are needed. It also will prove beneficial to 
allow for asynchronous time stepping in the AMR advancement; an example of the necessary 
modifications to the time advancement algorithm has been shown [46] . 

Dimensions above three also require additional empirical investigation for AMR effi- 
ciency. As indicated earlier, the potential savings from AMR increases geometrically with 
dimension, but AMR overhead, much of which scales with the number of cells at coarse-fine 
boundaries, also increases with dimension. Whether the overhead costs in 4D and above 
negate the savings remains an open issue and must be the subject of future studies. 

AMR overhead, even in lower dimensions, still requires further reduction. One clear path 
is to make use of hybrid parallelism as multicore architectures become more prevalent. Much 
of the computations contributing to AMR overhead are sequentially executed on a node but 
are completely independent and thus ideal for task parallelism. Implementations should also 
be found that further optimize the conservative, high-order intra-hierarchy interpolations. 

Finally, the topic of refinement criteria requires further investigation. Heuristic or a 
posteriori error indicators that seek to minimize local error are sufficient but not optimal, 
depending on the desired results of the calculation. Most quantities of interest exist in 
configuration space, that is, the macroscopic quantities like temperature and density that 
can be readily measured in the laboratory. The reductions leading to configuration space 
quantities integrate out much of the finer details in phase space, which suggests that it may 
be inefficient to resolve all of the finer phase-space structure. Future investigations should 
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consider (i) the phase space resolution requirements to obtain accurate configuration space 
quantities of interest and (ii) whether more efficient phase-space refinement criteria can be 
formulated based on these configuration-space accuracy requirements. 
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